AITopics | low-resource nlp

Collaborating Authors

low-resource nlp

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP

Remy, François, Delobelle, Pieter, Avetisyan, Hayastan, Khabibullina, Alfiya, de Lhoneux, Miryam, Demeester, Thomas

arXiv.org Artificial IntelligenceAug-8-2024

The development of monolingual language models for low and mid-resource languages continues to be hindered by the difficulty in sourcing high-quality training data. In this study, we present a novel cross-lingual vocabulary transfer strategy, trans-tokenization, designed to tackle this challenge and enable more efficient language adaptation. Our approach focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language. For this, we leverage a translation resource covering both the source and target languages. We validate our method with the Tweeties, a series of trans-tokenized LLMs, and demonstrate their competitive performance on various downstream tasks across a small but diverse set of languages. Additionally, we introduce Hydra LLMs, models with multiple swappable language modeling heads and embedding tables, which further extend the capabilities of our trans-tokenization strategy. By designing a Hydra LLM based on the multilingual model TowerInstruct, we developed a state-of-the-art machine translation model for Tatar, in a zero-shot manner, completely bypassing the need for high-quality parallel data. This breakthrough is particularly significant for low-resource languages like Tatar, where high-quality parallel data is hard to come by. By lowering the data and time requirements for training high-quality models, our trans-tokenization strategy allows for the development of LLMs for a wider range of languages, especially those with limited resources. We hope that our work will inspire further research and collaboration in the field of cross-lingual vocabulary transfer and contribute to the empowerment of languages on a global scale.

language adaptation, low-resource nlp, trans-tokenization and cross-lingual vocabulary transfer, (1 more...)

arXiv.org Artificial Intelligence

2408.04303

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

A Visual Guide to Low-Resource NLP

#artificialintelligenceSep-1-2021, 05:46:37 GMT

Deep neural networks are becoming omnipresent in natural language applications (NLP). However, they require large amounts of labeled training data, which is often only available for English. This is a big challenge for many languages and domains where labeled data is limited. In recent years, a variety of methods have been proposed to tackle this situation. This article gives an overview of these approaches that help you train NLP models in resource-lean scenarios.

language model, low-resource nlp, low-resource scenario, (12 more...)

#artificialintelligence

Country:

North America > Mexico > Mexico City > Mexico City (0.05)
North America > United States (0.05)

Genre: Overview (0.79)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.55)

Add feedback